Evaluating Answers to Definition Questions
نویسنده
چکیده
This paper describes an initial evaluation of systems that answer questions seeking definitions. The results suggest that humans agree sufficiently as to what the basic concepts that should be included in the definition of a particular subject are to permit the computation of concept recall. Computing concept precision is more problematic, however. Using the length in characters of a definition is a crude approximation to concept precision that is nonetheless sufficient to correlate with humans’ subjective assessment of definition quality. The TREC question answering track has sponsored a series of evaluations of systems’ abilities to answer closed class questions in many domains (Voorhees, 2001). Closed class questions are fact-based, short answer questions. The evaluation of QA systems for closed class questions is relatively simple because a response to such a question can be meaningfully judged on a binary scale of right/wrong. Increasing the complexity of the question type even slightly significantly increases the difficulty of the evaluation because partial credit for responses must then be accommodated. The ARDA AQUAINT1 program is a research initiative sponsored by the U.S. Department of Defense aimed at increasing the kinds and difficulty of the questions automatic systems can answer. A series of pilot evaluations has been planned as part of the research agenda of the AQUAINT program. The purpose of each pilot is to develop an effective evaluation methodology for systems that answer a certain kind of question. One of the first pilots to be implemented was the Definitions Pilot, a pilot to develop an evaluation methodology for questions such as What is mold? and Who is Colin Powell?. See http:///www.ic-arda.org/InfoExploit/ aquaint/index.html. This paper presents the results of the pilot evaluation. The pilot demonstrated that human assessors generally agree on the concepts that should appear in the definition for a particular subject, and can find those concepts in the systems’ responses. Such judgments support the computation of concept recall, but do not support concept precision since it is not feasible to enumerate all concepts contained within a system response. Instead, the length of a response is used to approximate concept precision. An F-measure score combining concept recall and length is used as the final metric for a response. Systems ranked by average F score correlate well with assessors’ subjective opinions as to definition quality.
منابع مشابه
Evaluating EFL Learners’ Philosophical Mentality through their Answers to Philosophical Questions: Using Smith’s Framework
Given the role philosophical mentality can fulfill in bringing individuals the essential skills of wisdom and well thinking, the present paper, by applying Smith’s (2007) theoretical framework, strived to explore the extent philosophic-mindedness exists among the participants. Considering the fact that, a philosophic mind begets philosophical answers, the participants’ philosophical thi...
متن کاملAutomatically Evaluating Answers to Definition Questions
Following recent developments in the automatic evaluation of machine translation and document summarization, we present a similar approach, implemented in a measure called POURPRE, for automatically evaluating answers to definition questions. Until now, the only way to assess the correctness of answers to such questions involves manual determination of whether an information nugget appears in a...
متن کاملQualitative Dimensions in Question Answering: Extending the Definitional QA Task
Current question answering tasks handle definitional questions by seeking answers which are factual in nature. While factual answers are a very important component in defining entities, a wealth of qualitative data is often ignored. In this incipient work, we define qualitative dimensions (credibility, sentiment, contradictions etc.) for evaluating answers to definitional questions and we explo...
متن کاملLAMP - TR - 119 CS - TR - 4695 UMIACS - TR - 2005 - 04 February 2005 Automatically Evaluating Answers to Definition Questions
Following recent developments in the automatic evaluation of machine translation and document summarization, we present a similar approach, implemented in a measure called Pourpre, for automatically evaluating answers to definition questions. Until now, the only way to assess the correctness of answers to such questions involves manual determination of whether an information nugget appears in a...
متن کاملProcessing Definition Questions in an Open-Domain Question Answering System
This paper presents a hybrid method for finding answers to Definition questions within large text collections. Because candidate answers to Definition questions do not generally fall in clearly defined semantic categories, answer discovery is guided by a combination of pattern matching and WordNet-based question expansion. The method is incorporated in a large opendomain question answering syst...
متن کاملFinding Answers to Definition Questions across the Spanish Web
In the last two years, we have been developing a multilingual web question answering system, which is aimed at discovering answers to natural language questions in three different languages: English, German and Spanish. One of its major components is the module that extracts answers to definition questions from the web. This paper compares and provides insights into different techniques that we...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003